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We present here the most recent version of FermiQCD, a collection of C++ classes, functions and parallel 
algorithms for lattice QCD, based on Matrix Distributed Processing. FermiQCD allows fast development of 
parallel lattice applications and includes some SSE2 optimizations for clusters of Pentium 4 PCs. 



1. Introduction 

FermiQCD is a collection of classes, functions 
and parallel algorithms for lattice QCD Q| , writ- 
ten in C++. It is based on Matrix Distributed 
Processing! (MDP) §]. The latter is a library 
that includes CH — h methods for matrix manipu- 
lation, advanced statistical analysis (such as Jack- 
knife and Boostrap) and optimized algorithms for 
interprocess communications of distributed lat- 
tices and fields. These communications are im- 
plemented using Message Passing Interface (MPI) 
but MPI calls are hidden to the high level algo- 
rithms that constitute FermiQCD. 

FermiQCD works also on single processor com- 
puters and, in this case, MPI is not required. 

2. FermiQCD overview 

The basic fields defined in FermiQCD are: 

class gauge_f ield: 

List of implemented algorithms: 

• heatbath algorithm 

• anisotropic heatbath 

• O(a^) heatbath 

These algorithms work for arbitrary gauge groups 
SU{Nc), for arbitrary lattice dimensions and 
topologies. FermiQCD also supports arbitrarily 
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twisted boundary conditions for large /3 compu- 
tations and studies of topology. 

class fermiJield: 

List of implemented algorithms: 

• multiplication by Q = iP + ™)j for Wil- 
son and Clover actions, for isotropic and 
anisotropic lattices 

• minimal residue inversion for Q 

• stabilized biconjugate gradient (BiCGStab) 
inversion for Q 

• Wupperthal smearing for the field 

• stochastic propagators 

These algorithms work for arbitrary gauge groups 
SU (Nc) and for arbitrary topologies in 4 dimen- 
sions. The multiplication by Q, clover (isotropic 
and anisotropic), for SU{3), is optimized using 
Pentium 4 SSE2 instructions in assembler lan- 
guage. This implementation is based on the 
assembler macro functions written by Martin 
Liischer |^ 

class f ermi_propagator: 

This is an implementation of ordinary quark 
propagators. A f ermi_propagator can be gen- 
erated using any of the inversion algorithms of a 
f ermi Jield. 

class staggered J ield: 

Kogut-Susskind (KS) fermion. List of imple- 
mented algorithms: 
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• multiplication by Q, for unimproved and 
O(a^) (Asqtad) improved actions Q 

• BiCGStab inversion for Q 

• BiCGStab inversion for Q using the UML 
decomposition 

• function makejneson 

These algorithms work for arbitrary gauge groups 
SU{Nc) and for an arbitrary even number of di- 
mensions (except makejneson). The multiplica- 
tion by Q, both improved and unimproved, for 
SU{3), is optimized using Pentium 4 SSE2 in- 
structions in assembler language. In the unim- 
proved case only half of the SSE2 registries are 
used and there is room for an extra factor two in 
speed. The function makejneson builds any me- 
son propagator (made out of staggered quarks) 
for arbitrary Spin(8)Flavour structure. This algo- 
rithm is described in ref. 1^ 

class staggered_propagator: 

This is an implementations of the staggered 
propagator consisting of 16 sources contained in 
a 2^ hypercube at the origin of the lattice. A 
staggered_propagator can be used to propagate 
any hadron from the hypercube at the origin of 
the lattice to any other hypercube without extra 
inversions. 

All fields in FermiQCD inherit the standard 
I/O methods of MDP (save and load) and the 
file format is independent on the lattice partition- 
ing over the parallel processes. These I/O func- 
tions, as well as all the FermiQCD algorithms, 
are designed to optimize interprocess communi- 
cations. 

3. Example 

We present here, as an example, a full program 
that generates 100 SU{3) gauge configurations 
(U), starting from a hot one. On each configura- 
tion it computes a pion propagator (pion) made 
of O(a^) improved quark propagators and prints 
it out. These propagators are computed using the 
SSE2 optimized clover action and the BiCGStab 
inversion algorithm. The program works in par- 
allel. 



#define PARALLEL 
#include "f ermiqcd.h" 

void main(int argc, char **argv) { 
mpi . open_wormlioles (argc , argv) ; 

int t,a,b,conf; 

int 110=3, box [4] ={16,8,8,8}; 

generic_lattice L(4,box); 

gauge_field U(L,nc); 

f ermi_propagator S(L,nc); 

site x(L) ; 

float pion [16] ; 

U . parcun . beta=5 . 7 ; 

S . par am . kappa=0 . 1345 ; 

S .parcim. cSW=l . 5 ; 

default _fermi_action= 

mul_Q_Luscher ; 
default _inversion_method= 

BiCGStab_inversion; 

set_liot(U) ; 

heatbathCU, 100) ; 

for(conf=0; conf<100; conf++) { 

heatbathCU, 30) ; 

compute_em_f ield(U) ; 

generate (S ,U) ; 

for(t=0; t<16; t++) pion[t]=0; 
f orallsites (x) 

for(a=0; a<4; a++) 
for(b=0; b<4; b++) 
pion [x(TlME)]+= 

real(trace(S(x,a,b)* 
hermitian(S(x,b,a)))) ; 
mpi . add (pion, 16); 
if(ME==0) for(t=0; t<16; t++) 

printf("7,i 7.f \n" , t, pion[t]); 

} 

mpi . close_wormholes ; 



Comments: 

• L is the user-defined name of the lattice 
(16 X 8^) 
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• Wilson 

" Clover 
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Figure 1. Time per site in /isec for mul_Q_Luscher 
(Wilson, clover, KS and Asqtad KS in single pre- 
cision) . 
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• Wilson (double) 

> Clover (double) 

- * KS (double) 

- « Asqtad KS (double) 
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Figure 2. Time per site in /xsec for niul_Q_Luscher 
(Wilson, clover, KS and Asqtad KS in single pre- 
cision) . 



• U, S and x are the gauge field, the fermi 
propagator and an auxiliary site variable 
defined on the lattice L 

• default Jermi_action is a pointer to the 
function that implements the action to be 
used. mul_Q_Luscher is one of the the built- 
in clover actions, optimized for Pentium 4. 

• def ault_inversionjiietliod is a pointer to 
the function that implements the inversion 
algorithm (minimal residue or BiCGStab) 

• compute_em_f ield computes the chromo- 
electro-magnetic field required by the ac- 
tion^. 

• generate computes the quark propagtor S 
on the given gauge configuration U. 

• f orallsites (x) is a parallel loop on x. 
Each processor loops on the local sites. 



• mpi . addCpion, 16) sums 
pion[16] in parallel. 



the vector 



^The chromo-electro-magnetic field is a member variable 
of the gauge field. FermiQCD has almost no global vari- 
ables except pointers to the functions that implement the 
algorithms. 



• if (ME==0) guarantees that only one proces- 
sor performs the output. 

4. Benchmarks 

In fig. 1^ and fig. |2| we report some benchmarks 
for the multiplication by Q of a fermionic field, 
for the different actions (using a single CPU Pen- 
tium 4 PC running at 1.4 GHz, Linux 2.4 and gcc 
2.95.3). 

This work was performed at Fermilab (U.S. De- 
partment of Energy Lab (operated by the Univer- 
sity Research Association, Inc.), under contract 
DE-AC02-76CHO3000. 
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